560 research outputs found

    DC-Prophet: Predicting Catastrophic Machine Failures in DataCenters

    Full text link
    When will a server fail catastrophically in an industrial datacenter? Is it possible to forecast these failures so preventive actions can be taken to increase the reliability of a datacenter? To answer these questions, we have studied what are probably the largest, publicly available datacenter traces, containing more than 104 million events from 12,500 machines. Among these samples, we observe and categorize three types of machine failures, all of which are catastrophic and may lead to information loss, or even worse, reliability degradation of a datacenter. We further propose a two-stage framework-DC-Prophet-based on One-Class Support Vector Machine and Random Forest. DC-Prophet extracts surprising patterns and accurately predicts the next failure of a machine. Experimental results show that DC-Prophet achieves an AUC of 0.93 in predicting the next machine failure, and a F3-score of 0.88 (out of 1). On average, DC-Prophet outperforms other classical machine learning methods by 39.45% in F3-score.Comment: 13 pages, 5 figures, accepted by 2017 ECML PKD

    Forecasting Player Behavioral Data and Simulating in-Game Events

    Full text link
    Understanding player behavior is fundamental in game data science. Video games evolve as players interact with the game, so being able to foresee player experience would help to ensure a successful game development. In particular, game developers need to evaluate beforehand the impact of in-game events. Simulation optimization of these events is crucial to increase player engagement and maximize monetization. We present an experimental analysis of several methods to forecast game-related variables, with two main aims: to obtain accurate predictions of in-app purchases and playtime in an operational production environment, and to perform simulations of in-game events in order to maximize sales and playtime. Our ultimate purpose is to take a step towards the data-driven development of games. The results suggest that, even though the performance of traditional approaches such as ARIMA is still better, the outcomes of state-of-the-art techniques like deep learning are promising. Deep learning comes up as a well-suited general model that could be used to forecast a variety of time series with different dynamic behaviors

    Population mortality during the outbreak of Severe Acute Respiratory Syndrome in Toronto

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Extraordinary infection control measures limited access to medical care in the Greater Toronto Area during the 2003 Severe Acute Respiratory Syndrome (SARS) outbreak. The objective of this study was to determine if the period of these infection control measures was associated with changes in overall population mortality due to causes other than SARS.</p> <p>Methods</p> <p>Observational study of death registry data, using Poisson regression and interrupted time-series analysis to examine all-cause mortality rates (excluding deaths due to SARS) before, during, and after the SARS outbreak. The population of Ontario was grouped into the Greater Toronto Area (N = 2.9 million) and the rest of Ontario (N = 9.3 million) based upon the level of restrictions on delivery of clinical services during the SARS outbreak.</p> <p>Results</p> <p>There was no significant change in mortality in the Greater Toronto Area before, during, and after the period of the SARS outbreak in 2003 compared to the corresponding time periods in 2002 and 2001. The rate ratio for all-cause mortality during the SARS outbreak was 0.99 [95% Confidence Interval (CI) 0.93–1.06] compared to 2002 and 0.96 [95% CI 0.90–1.03] compared to 2001. An interrupted time series analysis found no significant change in mortality rates in the Greater Toronto Area associated with the period of the SARS outbreak.</p> <p>Conclusion</p> <p>Limitations on access to medical services during the 2003 SARS outbreak in Toronto had no observable impact on short-term population mortality. Effects on morbidity and long-term mortality were not assessed. Efforts to contain future infectious disease outbreaks due to influenza or other agents must consider effects on access to essential health care services.</p

    Modelling informative time points: an evolutionary process approach

    Get PDF
    Real time series sometimes exhibit various types of "irregularities": missing observations, observations collected not regularly over time for practical reasons, observation times driven by the series itself, or outlying observations. However, the vast majority of methods of time series analysis are designed for regular time series only. A particular case of irregularly spaced time series is that in which the sampling procedure over time depends also on the observed values. In such situations, there is stochastic dependence between the process being modelled and the times of the observations. In this work, we propose a model in which the sampling design depends on all past history of the observed processes. Taking into account the natural temporal order underlying available data represented by a time series, then a modelling approach based on evolutionary processes seems a natural choice. We consider maximum likelihood estimation of the model parameters. Numerical studies with simulated and real data sets are performed to illustrate the benefits of this model-based approach.- The authors acknowledge Foundation FCT (FundacAo para a Ciencia e Tecnologia) as members of the research project PTDC/MAT-STA/28243/2017 and Center for Research & Development in Mathematics and Applications of Aveiro University within project UID/MAT/04106/2019

    RNA Unwinding by NS3 Helicase: A Statistical Approach

    Get PDF
    The study of double-stranded RNA unwinding by helicases is a problem of basic scientific interest. One such example is provided by studies on the hepatitis C virus (HCV) NS3 helicase using single molecule mechanical experiments. HCV currently infects nearly 3% of the world population and NS3 is a protein essential for viral genome replication. The objective of this study is to model the RNA unwinding mechanism based on previously published data and study its characteristics and their dependence on force, ATP and NS3 protein concentration. In this work, RNA unwinding by NS3 helicase is hypothesized to occur in a series of discrete steps and the steps themselves occurring in accordance with an underlying point process. A point process driven change point model is employed to model the RNA unwinding mechanism. The results are in large agreement with findings in previous studies. A gamma distribution based renewal process was found to model well the point process that drives the unwinding mechanism. The analysis suggests that the periods of constant extension observed during NS3 activity can indeed be classified into pauses and subpauses and that each depend on the ATP concentration. The step size is independent of external factors and seems to have a median value of 11.37 base pairs. The steps themselves are composed of a number of substeps with an average of about 4 substeps per step and an average substep size of about 3.7 base pairs. An interesting finding pertains to the stepping velocity. Our analysis indicates that stepping velocity may be of two kinds- a low and a high velocity

    Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline

    Full text link
    From medical charts to national census, healthcare has traditionally operated under a paper-based paradigm. However, the past decade has marked a long and arduous transformation bringing healthcare into the digital age. Ranging from electronic health records, to digitized imaging and laboratory reports, to public health datasets, today, healthcare now generates an incredible amount of digital information. Such a wealth of data presents an exciting opportunity for integrated machine learning solutions to address problems across multiple facets of healthcare practice and administration. Unfortunately, the ability to derive accurate and informative insights requires more than the ability to execute machine learning models. Rather, a deeper understanding of the data on which the models are run is imperative for their success. While a significant effort has been undertaken to develop models able to process the volume of data obtained during the analysis of millions of digitalized patient records, it is important to remember that volume represents only one aspect of the data. In fact, drawing on data from an increasingly diverse set of sources, healthcare data presents an incredibly complex set of attributes that must be accounted for throughout the machine learning pipeline. This chapter focuses on highlighting such challenges, and is broken down into three distinct components, each representing a phase of the pipeline. We begin with attributes of the data accounted for during preprocessing, then move to considerations during model building, and end with challenges to the interpretation of model output. For each component, we present a discussion around data as it relates to the healthcare domain and offer insight into the challenges each may impose on the efficiency of machine learning techniques.Comment: Healthcare Informatics, Machine Learning, Knowledge Discovery: 20 Pages, 1 Figur

    Distributed Fine-Grained Traffic Speed Prediction for Large-Scale Transportation Networks based on Automatic LSTM Customization and Sharing

    Full text link
    Short-term traffic speed prediction has been an important research topic in the past decade, and many approaches have been introduced. However, providing fine-grained, accurate, and efficient traffic-speed prediction for large-scale transportation networks where numerous traffic detectors are deployed has not been well studied. In this paper, we propose DistPre, which is a distributed fine-grained traffic speed prediction scheme for large-scale transportation networks. To achieve fine-grained and accurate traffic-speed prediction, DistPre customizes a Long Short-Term Memory (LSTM) model with an appropriate hyperparameter configuration for a detector. To make such customization process efficient and applicable for large-scale transportation networks, DistPre conducts LSTM customization on a cluster of computation nodes and allows any trained LSTM model to be shared between different detectors. If a detector observes a similar traffic pattern to another one, DistPre directly shares the existing LSTM model between the two detectors rather than customizing an LSTM model per detector. Experiments based on traffic data collected from freeway I5-N in California are conducted to evaluate the performance of DistPre. The results demonstrate that DistPre provides time-efficient LSTM customization and accurate fine-grained traffic-speed prediction for large-scale transportation networks.Comment: 14 pages, 7 figures, 2 tables, Euro-par 2020 conferenc

    Observed Reductions in Schistosoma mansoni Transmission from Large-Scale Administration of Praziquantel in Uganda: A Mathematical Modelling Study

    Get PDF
    To date schistosomiasis control programmes based on chemotherapy have largely aimed at controlling morbidity in treated individuals rather than at suppressing transmission. In this study, a mathematical modelling approach was used to estimate reductions in the rate of Schistosoma mansoni reinfection following annual mass drug administration (MDA) with praziquantel in Uganda over four years (2003-2006). In doing this we aim to elucidate the benefits of MDA in reducing community transmission.Age-structured models were fitted to a longitudinal cohort followed up across successive rounds of annual treatment for four years (Baseline: 2003, TREATMENT: 2004-2006; n = 1,764). Instead of modelling contamination, infection and immunity processes separately, these functions were combined in order to estimate a composite force of infection (FOI), i.e., the rate of parasite acquisition by hosts.MDA achieved substantial and statistically significant reductions in the FOI following one round of treatment in areas of low baseline infection intensity, and following two rounds in areas with high and medium intensities. In all areas, the FOI remained suppressed following a third round of treatment.This study represents one of the first attempts to monitor reductions in the FOI within a large-scale MDA schistosomiasis morbidity control programme in sub-Saharan Africa. The results indicate that the Schistosomiasis Control Initiative, as a model for other MDA programmes, is likely exerting a significant ancillary impact on reducing transmission within the community, and may provide health benefits to those who do not receive treatment. The results obtained will have implications for evaluating the cost-effectiveness of schistosomiasis control programmes and the design of monitoring and evaluation approaches in general

    Time series modeling for syndromic surveillance

    Get PDF
    BACKGROUND: Emergency department (ED) based syndromic surveillance systems identify abnormally high visit rates that may be an early signal of a bioterrorist attack. For example, an anthrax outbreak might first be detectable as an unusual increase in the number of patients reporting to the ED with respiratory symptoms. Reliably identifying these abnormal visit patterns requires a good understanding of the normal patterns of healthcare usage. Unfortunately, systematic methods for determining the expected number of (ED) visits on a particular day have not yet been well established. We present here a generalized methodology for developing models of expected ED visit rates. METHODS: Using time-series methods, we developed robust models of ED utilization for the purpose of defining expected visit rates. The models were based on nearly a decade of historical data at a major metropolitan academic, tertiary care pediatric emergency department. The historical data were fit using trimmed-mean seasonal models, and additional models were fit with autoregressive integrated moving average (ARIMA) residuals to account for recent trends in the data. The detection capabilities of the model were tested with simulated outbreaks. RESULTS: Models were built both for overall visits and for respiratory-related visits, classified according to the chief complaint recorded at the beginning of each visit. The mean absolute percentage error of the ARIMA models was 9.37% for overall visits and 27.54% for respiratory visits. A simple detection system based on the ARIMA model of overall visits was able to detect 7-day-long simulated outbreaks of 30 visits per day with 100% sensitivity and 97% specificity. Sensitivity decreased with outbreak size, dropping to 94% for outbreaks of 20 visits per day, and 57% for 10 visits per day, all while maintaining a 97% benchmark specificity. CONCLUSIONS: Time series methods applied to historical ED utilization data are an important tool for syndromic surveillance. Accurate forecasting of emergency department total utilization as well as the rates of particular syndromes is possible. The multiple models in the system account for both long-term and recent trends, and an integrated alarms strategy combining these two perspectives may provide a more complete picture to public health authorities. The systematic methodology described here can be generalized to other healthcare settings to develop automated surveillance systems capable of detecting anomalies in disease patterns and healthcare utilization
    • …
    corecore